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Abstract 

It is well known that for any finite state Markov decision process (MDP) there is a memoryless 
deterministic policy that maximizes the expected reward. For partially observable Markov decision 
processes (POMDPs), optimal memoryless policies are generally stochastic. We study the expected 
reward optimization problem over the set of memoryless stochastic policies. We formulate this as 
a constrained linear optimization problem and develop a corresponding geometric framework. We 
show that any POMDP has an optimal memoryless policy of limited stochasticity, which allows 
us to reduce the dimensionality of the search space. Experiments demonstrate that this approach 
enables better and faster convergence of the policy gradient on the evaluated systems. 

Keywords: MDP, POMDP, partial observability, memoryless stochastic policy, average reward, 
policy gradient, reinforcement learning 


1. Introduction 


The field of reinforcement learning addresses a broad class of problems where an agent has to 
learn how to act in order to maximize some form of cumulative reward. On choosing action a at 
some world state w the world undergoes a transition to state w' with probability a{w'\w, a) and 
the agent receives a reward signal R{w, a, m'). A policy is a rule for selecting actions based on the 
information that is available to the agent at each time step. In the simplest case, the Markov decision 
process (MDP), the full world state is available to the agent at each time step. A key result in this 


context shows the existence of optimal policies which are memoryless and deterministic (see Ross 


19831. In other words, the agent performs optimally by choosing one specific action af each time 
step based on the current world state. The agent does not need to take the history of world states 
into account, nor does he need to randomize his actions. 
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In many cases one has to assume that the agent experiences the world only through noisy sensors 
and the agent has to choose actions based only the partial information provided by these sensors. 
More precisely, if the world state is w, the agent only observes a sensor state s with probability 
I3{s\w). This setting is known as partially observable Markov decision process (POMDP). Policy 
optimization for POMDPs has been discussed by several authors (see Sondik[ 1978t Chrismani 


1992 Littman et al.[ 1995 McCallum 1996 Parr and Russell! 19951. Optimal policies generally 


need to take the history of sensor states into account. This requires that the agent be equipped with 
a memory that stores the sensor history or an encoding thereof (e.g., a properly updated belief state) 
which may require additional computation. 


Although in principle possible, in practice it is often too expensive to find or even to store 
and execute completely general optimal policies. Some form of representation or approximation is 
needed. In particular, in the context of embodied artificial infelligence and sysfems design ([Pfeifer 


and Bongard[ 20061 fhe on-board compufafion sefs limifs fo fhe complexify of fhe confroller wifh 


respecf fo bofh, memory and compufafional cosf. We are inferesfed in policies wifh limifed mem¬ 


ory (see, e.g., Hansen 19981. In fad we will focus on memoryless sfochasfic policies (see Singh 


ef al.| 1994 Jaakkola el al. 19951. Memoryless policies may be worse lhan policies wifh memory. 


buf Ihey require far fewer parameters and compufafion. Among olher approaches, fhe GPOMDP 
algorifhm ( jBaxfer and Bartlettl 200 Ij ) provides a gradienf based melhod fo opfimize fhe expecfed 
reward over paramefric models of memoryless sfochasfic policies. For inferesling sysfems, fhe sel of 
all memoryless sfochasfic policies can still be very high dimensional and if is imporfanf fo find good 
models. In Ihis article we show fhal each POMDP has an oplimal memoryless policy of limifed 
slochaslicily, which allows us fo consfrucf low-dimensional differentiable policy models wifh opfi- 
malify guaranfees. The amounf of slochaslicily can be bounded in terms of fhe amounf of perceplual 
aliasing, independenlly of fhe specific form of fhe reward signal. 


We follow a geomelric approach fo memory less policy oplimizalion for POMDPs. The key idea 
is fhal fhe objecfive funclion (fhe expected reward per time slep) can be regarded as a linear function 
over fhe sel of slafionary join! disfribulions over world slates and actions. For MDPs fhis sef is a 
convex polyfope and, in lum, fhere always exisls an opfimizer which is an exfreme poinl. The ex- 
Ireme poinls correspond fo deterministic policies (which cannol be written as convex combinalions 
of olher policies). For POMDPs fhis sel is in general nol convex, buf if can be decomposed info con¬ 
vex pieces. There exisls an optimizer which is an exfreme poinl of one of Ihese pieces. Depending 
on fhe dimension of fhe convex pieces, fhe opfimizer is more or less sfochasfic. 


This paper is organized as follows. In Section|^we review basics on POMDPs. In Seclionj^we 
discuss fhe reward oplimizalion problem in POMDPs as a conslrained linear oplimizalion problem 
wifh Iwo lypes of conslrainls. The firsl conslrainf is aboul fhe lypes of policies fhal can be rep- 
resenled in fhe underlying MDP The second conslrainf relales policies wifh slafionary world slate 
distributions. We discuss the details of these constraints in Sections [^ and |B] In Section |4| we use 
these geometric descriptions to show that any POMDP has an optimal stationary policy of limited 
stochasticity. In Sectionj^we apply the stochasticity bound to define low dimensional policy models 
wifh opfimalify guarantees. In Seclionj^we presenf experimenfs which demonslrafe fhe usefulness 
of fhe proposed models. In Section]^ we offer our conclusions. 
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2. Partially observable Markov decision processes 

A discrete time partially observable Markov decision process (POMDP) is defined by a tuple 
{W, S, A, a, /?, R), where W is a finite set of world states, S' is a finite set of sensor states, A is 
a finite set of actions, /?: VP —)• A 5 is a Markov kernel that describes sensor state probabilities 
given the world state, a\ W x A ^ A\y is a Markov kernel that describes the probability of transi¬ 
tioning to a world state given the current world state and action, i2:VPxA—)-Misa reward signal. 
A Markov decision process (MDP) is the special case where VP = S' and /3 is the identity map. 

A policy TT is a mechanism for selecting actions. In general, at each time step t G N, a policy is 
defined by a Markov kernel vr^ taking the history ht = (sq, oq, • • •, st) of sensor states and actions 
to a probability distribution 'Kt{-\ht) over A. A policy is deterministic when at each time step each 
possible history leads to a single positive probability action. A policy is memoryless when the 
distribution over actions only depends on the current sensor state, 'Kt{-\ht) = Trt{-\st). A policy is 
stationary (homogeneous) when it is memoryless and time independent, Trt{-\ht) = 7r(-|st) for all f. 
Stationary policies are represented by kernels of the form vr: S' —)• A^i. We denote the set of all 
such policies by A 5 a- 

The goal is to find a policy that maximizes some form of expected reward. We consider the long 
term expected reward per time step (also called average reward) 


R-aM = lim E„ r, ,T-1| 1 


T-1 




t=0 


( 1 ) 


Here Pr 




T-l 

t=0 


TT, is the probability of the sequence wq, oq, mi, ai,..., mr-i, ar-i> 
given that wq is distributed according to the start distribution fi G Ayy and at each time step actions 
are selected according to the policy tt. Another option is to consider a discount factor 7 G (0,1) 
and the discounted long term expected reward 




T-l 


'^y^R{wt,at) 


Lt=o 


( 2 ) 


In the case of an MDP, it is always possible to find an optimal memoryless deterministic pol¬ 
icy. In other words, there is a policy that chooses an action deterministically at each time step, 
depending only on the current world state, which achieves the same or higher long term expected 
reward as any other policy. This fact can be regarded as a consequence of the policy improvement 
theorem ( |Bellman| |1957t |Howard| |1960| ). 

In the case of a POMDP, policies with memory may perform much better than the memoryless 
policies. Furthermore, within the memoryless policies, stochastic policies may perform much better 
than the deterministic ones (see Singh et 1994[ ). The intuitive reason is simple: Several world 
states may produce the same sensor state with positive probability (perceptual aliasing). On the basis 
of such a sensor state alone, the agent cannot discriminate the underlying world state with certainty. 
On different world states the same action may lead to drastically different outcomes. Sometimes the 
agent is forced to choose probabilistically between the optimal actions for the possibly underlying 
world states (see Example [^. Sometimes he is forced to choose suboptimal actions in order to 
minimize the risk of catastrophic outcomes (see Example [T]). On the other hand, the sequence 
of previous sensor states may help the agent identify the current world state and choose one single 
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optimal action. This illustrates why in POMDPs optimal policies may need to take the entire history 
of sensor states into account and also why the optimal memoryless policies may require stochastic 
action choices. 

The set of policies that take the histories of sensor states and actions into account grows ex¬ 
tremely fast. A common approach is to transform the POMDP into a belief-state MDP, where the 
discrete sensor state is replaced by a continuous Bayesian belief about the current world state. Such 
belief states encode the history of sensor states and allow for representations of optimal policies. 
However, belief states are associated with costly internal computations from the side of the acting 
agent. We are interested in agents subject to perceptual, computational, and storage limitations. 
Here we investigate stationary policies. 

We assume that for each stationary policy vr G ^s,A there is exactly one stationary world state 
distribution p^{w) G Aw and that it is attained in the limit of infinite time when running policy tt, 
irrespective of the starting distribution /x. This is a standard assumption that holds true, for instance, 
whenever the transition kernel a is strictly positive. In this case ([T]l can be written as 

= '^p''M'^P^(alu!)Il(w,a), (3) 

w a 

where p'^{a\w) = optimal stationary policy is a policy ir* G As^a with 

> '^(vr) for all vr G As^a- Note that maximizing Q over is the same as maximiz¬ 
ing the discounted expected reward Q over As^a with p{w) = p^{w) (see Singh et al. 19941. 
The expected reward per time step appears more natural for POMDPs than the discounted expected 
reward, because, assuming ergodicity, it is independent of the starting distribution, which is not di¬ 
rectly accessible to the agent. Our discussion focusses on average rewards, but our main Theorem[7] 
also covers discounted rewards. 

Our analysis is motivated by the following natural question: Given that every MDP has a sta¬ 
tionary deterministic optimal policy, does every POMDP have an optimal stationary policy with 
small stochasticity? Bounding the required amount of stochasticity for a class of POMDPs would 
allow us to define a policy model Af C As^a with 

max 7^(7r) = max7^(7r), (4) 

ttSAs.a ttGM 

for every POMDP from that class. We will show that such a model A4 can be defined in terms of the 
number of ambiguous sensor states and actions, such that A4 contains optimal stationary policies for 
all POMDPs with that number of actions and ambiguous sensor states. Depending on this number, 
Af can be much smaller in dimension than the set of all stationary policies. 

The following examples illustrate some cases where optimal stationary control requires stochas¬ 
ticity and some of the intricacies involved in upper bounding the necessary amount of stochasticity. 

Example 1. Consider a system with W = {1,..., n}, S = {!}, and A = {1,..., n}. The reward 
function R{w, a) is +1 on a = re and —1 otherwise. The agent starts at some random state. On 
state w = i action a = i takes the agent to some random state and all other actions leave the state 
unchanged. In this case the best stationary policy chooses actions uniformly at random. 

Example 2. Consider the grid world illustrated in Figure [T^. The agent has four possible actions, 
north, east, south, and west, which are effective when there is no wall in that direction. On reaching 
cells 5, 11, and 13 the agent is teleported to cell 1. On 13 he receives a reward of one and otherwise 
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Figure 1: (a) Illustration of the maze Example The left part shows the configuration of world 
states. The upper right shows an optimal deterministic policy in the MDP setting. At 
each state, the policy action is in the black direction. The lower right shows the sensor 
states as observed by the agent in each world state, (b) State transitions from Example 


none. In an MDP setting, the agent knows its absolute position in the maze. A deterministic policy 
can be easily constructed that leads to a maximal reward, as depicted in the upper right. In a POMDP 
setting the agent may only sense the configuration of its immediate surrounding, as depicted in the 
lower right. In this case any memoryless deterministic policy fails. Cells 3 and 9 look the same to 
the agent. Always choosing the same action on this sensation will cause the agent to loop around 
never reaching the reward cell 13. Optimally, the agent should choose probabilistically between 
east and west. The reader might want to have a look at the experiments treating this example in 
Section 1^ 

Example 3. Consider the system illustrated in Eigure [^. Each node corresponds to a world state 
W = {0,1, 2, 3}. The sensor states are S' = {0,1, 3}, whereby 1, 2 are sensed as 1. The actions 
are A = {1, 2, 3}. Choosing action 1 in state 1 and action 2 in state 2 has a large negative reward. 
Choosing action 2 in state 1 and action 1 in state 2 has a large positive reward. Choosing action 3 
in 1, 2 has a moderate negative reward and takes the agent to state 3. Erom state 3 each action has a 
large positive reward and takes the agent to 0. Erom state 0 any action takes the agent to 1 or 2 with 
equal probability. In an MDP setting the optimal policy will choose action 2 on 1 and action 1 on 2. 
In a POMDP setting the optimal policy chooses action 3 on 1. This shows that the optimal actions 
in a POMDP do not necessarily correspond to the optimal actions in the underlying MDP. Similar 
examples can be constructed where on a given sensor state it may be necessary to choose from a 
large set of actions at random, larger than the set of actions that would be chosen on all possibly 
underlying world states, were they directly observed. 

3. Average reward maximization as a constrained linear optimization problem 

The expression in the expected reward Q is linear in 

the joint distribution p(u), a) = p{w)p{a\w) G AwxA- We want to exploit this linearity. The 
difficulty is that the optimization problem is with respect to the policy vr, not the joint distribution. 
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and the stationary world state distribution p^{w) depends on the policy. This implies that not all 
joint distributions p{w, a) are feasible. The feasible set is delimited by the following two conditions. 

• Representability in terms of the policy: 

p(a|t(;) = ^ 7 r(a|s)/ 3 (s|t(;), for some vr G A 5 ^^. (5) 

The geometric interpretation is that the conditional distribution p{a\w) belongs to the poly¬ 
tope G C /S.w,A defined as the image of A 5,4 by the linear map 

fp \ 7r(a|s) ^7r(a|s)/3(s|'«;). (6) 

In turn, the joint distribution a) belongs to the set F C AiyxA of joint distributions with 
conditionals p{a\w) from the set G. In general the set F is not convex, although it is convex 
in the marginals p{w) when fixing the conditionals p{a\w), and vice versa. We discuss the 
details of this constraint in Section lAl 


• Stationarity of the world state distribution: 

'^p{w,a)a{w'\w,a) G H, (7) 

a 

where H C xVE is the poly tope of distributions p{w,w') with equal first and second 
marginals, — Yhw' means that p(r(;) is a stationary distribution of 

the Markov transition kernel p{w'\w). The geometric interpretation is that p{w, a) belongs to 
the poly tope J := /~^(S) C A^i/xA defined as the preimage of H by the linear map 


fo,-. p{w,a) ^'^p{w,a)a{w'\w,a). ( 8 ) 

a 

We discuss the details of this constraint in Section I bI 

Summarizing, the objective function TZ: tt t-A J2wP^i'^) restriction of 

the linear function p{w, a) 1 —)• aPi'^i a) to a feasible domain of the form F n J C 

^WxA, where F is the set of joint distributions with conditionals from a convex polytope G, and J 
is a convex polytope. We illustrate these notions in the next example. 

Example 4. Consider the system illustrated at the top of Figure There are two world states 
VF = {1, 2}, two sensor states S = {1, 2}, and two possible actions A = {1, 2}. The sensor and 
transition probabilities are given by 


/3 


1/2 1 / 2 ' 
1/2 I/2J ’ 


a{-\w = 1 , •) 


1 0 
0 1 


a{-\w = 2 , •) 


1/2 1 / 2 ' 
1/2 1/2 


In the following we discuss the feasible set of joint distributions. The policy polytope IS.s^a is a 
square. The set of realizable conditional distributions of world states given actions is the line 


G = fj3{As,A) = conv 


1 0 
1 0 


0 

0 


1 

1 
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Figure 2: Illustration of Example The world state transitions are shown in the upper part. They 
are deterministic from w = 1 and random from w = 2. The lower left shows, in¬ 
side of AiyxA, the set F defined by the representability constraint Q and the poly tope 
J = defined by fhe sfafionarify consfrainf Q. The lower righf shows, inside 

of AwxW, the polyfopes fa{AwxA) and S. 


inside of fhe square Avf.a- The sef F of joinf disfribufions wifh condifionals from G is a fwisfed 
surface. This sef has one copy of G for every world sfafe disfribufion p{w). See fhe lower lefl of 
Figure]^ The sef J of joinf disfribufions over world sfafes and acfions fhaf satisfy fhe sfafionarify 
consfrainf Q is fhe subsef of Ay/xA thaf /„ maps fo fhe polyfope S shown in fhe lower righf of 
Figure]^ This is fhe friangle 


J = fa\^)= conv 


1 0 
0 0 


0 1/3' 
0 2/3 


0 1/311 

2/3 0 J J ■ 


As we will show in Lemmathe extreme points of J can always be written in terms of extreme 
points of Aw, a; in the present example, in terms of [q 5] (or [}[]]),[[]}], [^ g]- The set F n J 
is a curve. This is the feasible domain of the expected reward TZ, viewed as a function of joint 
distributions over world states and actions. 


4. Determinism of optimal stationary policies 

In this section we discuss the minimal stochasticity of optimal stationary policies. In order to illus¬ 
trate our geometric approach we first consider MDPs and then the more general case of POMDPs. 

Theorem 5 (MDPs). Consider an MDP (W, A, a, R). Then there is a deterministic optimal sta¬ 
tionary policy. 
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Proof of Theorem^ The objective function TZ defined in Equation Q can be regarded as the re¬ 
striction of a linear function over AwxA to the feasible set J defined in Equafion Q. Since J is a 
convex polyfope, fhe objecfive funcfion is maximized af one of ifs exfreme poinfs. By Eemmaj^ all 
exfreme poinfs of J can be realized by exfreme poinfs of Aw,A^ thaf is, deferminisfic policies. □ 

Lemma 6. Each extreme point of J can be written as p{w, a) = p{w)p{a\w), where p{w) G Aw 
and p{a\w) is an extreme point of Aw,A- 

Proof of Lemma^ We can view fhe map /„ from Equafion ([^ as faking pairs {p{w), p{a\w)) fo 
pairs {p{w),p{w'\w)). Here fhe marginal disfribufion is mapped by fhe identify funcfion Aw 
Aw; p{w) I—)• p{w) and fhe conditional disfribufion by 

fa - Aw, A Aw,h/; p{a\w) hA '^p{a\w)a{w'\w,a) =p{w'\w). 

a 

Consider some W' C W for which J confains a disfribufion q whose marginal has supporf 
W'. Eor each w G W lef A^, = {a G A: supp(Q!(-|t(;, a)) C W'} denote fhe sef of actions wifh 
fransifions fhaf sfay in W'. Wifh a slighf abuse of nofafion lef us write Aw',A' := X^gw' C 
Aw',a and Aw'xA' := Aw' * ^w',A' = {p{w,a)-. p{w) G Aw',p(a|'fr') G Aw',a'} ^ Aw'xA 
for fhe corresponding sefs of conditional and join! probabilify disfribufions. Note fhaf ouf of AwxA 
only poinfs from Aw'x A' are mapped fo poinfs in Aw'x VK' and hence J n Aw'xA ^ Aw'xA'- The 
sef /a(Aw'xA') consisfs of all joinf disfribufions p{w, w') = p{w)p{w'\w) wifh p{w) G Aw' and 
p{w'\w) G /q,(Aw',a') ^ Aw',vf'- Now, for each conditional G Aw',w' there is af leasf 

one marginal p{w) G Aw' such fhaf fhe join! p{w, w') G Aw'x VF' is an elemenf of H. Hence 

dim(/o(Aw'xA') nS) > dim(/„(Aw',A'))- 

The sef J n Aw'xA is the union of fhe fibers of all poinfs in /a(Aw'xA') Cl E. Hence 

dim( J n Aw'xa) > dim(/a(Aw'xA') C E) + (^dim(Aw',A') - dim(/o(Aw',A'))) 

> dim(/o,(Aw',A')) + dim(Aw',A') - dim(/a(Aw',A')) 

= dim(Aw',A')- 

Eef us now consider some exfreme poinf q of J. Suppose fhaf fhe marginal of q has supporf 
W'. By fhe previous discussion, we know fhaf q is an exfreme poinf of fhe polyfope J n Aw'xA'- 
Eurfhermore, J n Aw'xA' is the d-dimensional intersection of an affine space and Aw'xAS where 
d > dim(Aw',A')- This implies fhaf q lies af fhe intersection of d facels of Aw'xA'- In turn 
I supp(q(m, •))! = 1, for all w G W. This shows fhaf q{w, a) = p{w)p{w, a), where p{w) G Aw' 
andp(o|t(;) is an exfreme poinf of Aw',A'- We can extend fhis conditional arbifrarily on u) G W\W' 
fo obfain a conditional fhaf is an exfreme poinf of Aw,A- El 

Now we discuss fhe minimal sfochasficify of optimal sfafionary policies for POMDPs. A policy 
TT G A 5 ,4 is called m-stochastic if if is confained in an m-dimensional face of As, a- This means 
fhaf af mosf |5| + m enfries 7r(a|s) are non-zero and, in particular, fhaf vr is a convex combination 
of af mosf m -|- 1 deferminisfic policies. Eor insfance, a deferminisfic policy is O-sfochasfic and 
has exacfly |5| non-zero enfries. The following resulf holds bofh in fhe average reward and in fhe 
discounfed reward settings. 
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Theorem 7 (POMDPs). Considera POMDP {W, S, A,a, (3, R). LetU = {s G 5: | supp(/3(s|-))| > 
1}. Then there is a |C/|(|A| — l)-stochastic optimal stationary policy. Furthermore, for any W, S, A 
there are a, (3, R such that every optimal stationary policy is at least |C/|(|74| — l)-stochastic. 


Proof of Theorem^ Here we prove the statement for the average reward setting using the geometric 
descriptions from Section]^ We cover the discounted setting in Sectionj^using value functions and 
a policy improvement argument. 

Consider the sets G = ff^{As^A) ^ Aw,A and F = Aw * G C AwxA from Equation Q. 
We can write G as a union of Cartesian products of convex sets, as G = Use© G'e, with dim(G) — 
dim(G 6 i) < dim(0) = |C/|(|74| — 1). See Propositionfor details. In turn, we can write F = 
Uee© where each F^ = Aw * G^ is a convex set of dimension dim(F 0 ) = dim(Aw) + 
dim(G 6 i). See Proposition [T^for details. 

The objective function 7Z is linear over each polytope n J and is maximized at an extreme 
point of one of these polytopes. If n J / 0, then each extreme point of Fe n J can be written as 
p(w, a) = p(w)p(alw), where p(a|r(;) is an extreme point of Gg. To see this, note that the arguments 
of Lemma Instill hold when we replace J by Fe n J and Aw, A by Gg. Each extreme point of Gg 
lies at a face of G of dimension at most |[/|(|^| — 1). See Propositionfor details. Now, since fjs 
is a linear map, the points in the m-dimensional faces of G have preimages by in m-dimensional 
faces of As^a- Thus, there is a maximizer of TZ that is contained in a |?7| (|^| — 1) face of A^ a- 

The second statement, regarding the optimality of the stochasticity bound, follows from Propo¬ 
sition!^ which computes the optimal stationary policies of a class of POMDPs analytically. □ 


Remark 8. 


• Our Theorem|^also has an interpretation for non-ergodic systems: Among all pairs {tt,p^{w)) 
of stationary policies and associated stationary world state distributions, the highest value of 

attained by a pair where the policy vr is |t/|(|A| — 1 )- 
stochastic. However, this optimal stationary average reward is only equal to o for start 
distributions p that converge to p'^{w). 

• Eor MDPs the set U is empty and the statement of Theorem [^recovers Theorem 

• In a reinforcement learning setting the agent does not know anything about the world state 
transitions a nor the observation model /3 a priori, beside from the sets S and A. In particular, 
he does not know the set U (nor its cardinality). Nonetheless, he can build a hypothesis about 
U on the basis of observed sensor states, actions, and rewards. This can be done using a 
suitable variant of the Baum-Welch algorithm or inexpensive heuristics, without estimating 
the full kernels a and f3. 


5 . Application to defining low dimensional policy models 

By Theorem 1^ there always exists an optimal stationary policy in a |?7|(|A| — 1)-dimensional face 
of the policy polytope As^a- Instead of optimizing over the entire set As^a, we can optimize over 
a lower dimensional subset that contains the |C/|(|A| — 1)-dimensional faces. In the following we 
discuss various ways of defining a differentiable policy model with this property. 

We denote the set of policies in m-dimensional faces of the polytope As^a by 

Cm ■= {tt G As,a: supp(7r) < IS"! -f m}. 
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Note that each policy in Cm can be written as the convex combination of m+1 or fewer deterministic 
policies. For example, Cq = {vr-^ (a|s) = <5/(s)(a): / £ is the set of deterministic policies, and 
C*|s|(|A|-i) = ^S,A is the entire set of stationary policies. 

Conditional exponential families An exponential policy family is a set of policies of the form 

exp(6'’^F(s,a)) 

faT TPf 

exp(0'F(5,aO) 

where F: S' x A —)■ is a vector of sufficient statistics and 0 G is a vector of parameters. We 

can choose F suitably, such that the closure of the exponential family contains Cm- 
The k-interaction model is defined by fhe sufficienf sfafisfics 

^A(a:) = n(-ir, a;G{0,ir, AC n},l<|A|<A:. 

ie\ 

Here we can idenfify each pair {s,a) £ S x A wifh a lengfh-n binary vector x G {0,1}”, n = 
[log 2 (|S'| |A|)]. Since we do nof need to model fhe marginal disfribufion over S, we can remove all 


model of dimension af mosf X!=i^ l+"i+i) ^ri°g2(|5||A|)]^ Note fhaf fhis is only an upper bound, 

bofh on k and fhe dimension, and usually a smaller model will be sufficient. 

An alternative exponential family is defined by faking F{s,a), {s,a) £ S x A, equal to fhe 
verfices of a cyclic polyfope. The cyclic polyfope C{N, d) is fhe convex hull of {x(fi),..., x{tN)}, 
where x{t) = [t, ..., ti < t 2 < ■ ■ ■ < N > d > 2. This resulfs in a [d/2j-neighborly 

model. Using Ibis approach yields a policy model of dimension 2(|5| + m). 

Mixtures of deterministic policies We can consider policy models of the form 

T^e{a\s) = ^ Tr^{a\s)p0{f), 
feAS 

where vr-^ {a\s) = (a) is the deterministic policy defined by the function f: S ^ A and pe{f ) 

is a model of probability distributions over the set of all such functions. Choosing this as a (m + 1)- 
neighborly exponential family yields a policy model which contains Cm and, in fact, all mixtures 
of m + 1 deterministic policies. This kind of model was proposed in |Ay et aL| ( |2013| l. 

Identifying each / G A^ with a length-n binary vector, n > |'log 2 (|A|l'^l)], and using a k- 
interaction model with 2^ — 1 = m + 1 yields a model of dimension ^r^§ 2 (™-+ 2 )l ^riog 2 (|A|l l)]^ 
Alternatively, we can use a cyclic exponential family for pg, which yields a policy model of 
dimension 2(m + 1). If we are only interested in modeling the deterministic policies, m = 0, then 
this model has dimension two. 


A for which Fx{s, •) is constant for all s. The /c-interaction model is (2^ — l)-neighborly (Kahle 
20101, meaning that, for 2^ — 1 > [S’! + m it contains Cm in its closure. This results in a policy 


Conditional restricted Boltzmann machines A conditional restricted Boltzmann machine (CRBM) 
is a model of policies of the form 


T^e{y\x) 


— y 

Z{x) 

2S{0,l}"hiddcn 


eyip{z^Vx + z' Wy + h' y + c' z)^ 
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with parameter 9 = {W, V, b, c}, W G K^-wddenXnout^ y ^ ]^ri.hiddcnxni„^ ^ ^ M”™>, c G M"'Wdden_ jjej-g 
we identify each s G S with a vector a: G {0,1}”^“, nin = [log 2 |5|], and each a G A with a vector 
y G {0,riout = riog 2 I^H- There are theoretical results on CRBMs (Montiifar et al. 20151 
showing that they can represent every policy from Cm whenever rihidden > | S’] + m — 1. A sufficient 
number of parameters is thus {\S\ + m — l)(|'log 2 IS"!] + |'log 2 (|A|)]) + |'log 2 (|A|)]. 

Each of these models has advantages and disadvantages. The CRBMs can be sampled very 
efficiently using a Gibbs sampling approach. The mixture models can be very low dimensional, but 
may have an intricate geometry. The ^-interaction models are smooth manifolds. 


6. Experiments 

We run computer experiments to explore the practical utility of our theoretical results. We consider 
the maze from Example]^ In this example, the set U of sensor states s with | supp(/3(s|-))| > 1 has 
cardinality two. By Theorem|^ there is a |f7|(|A| — 1) = 6 stochastic optimal stationary policy. As 
a family of policy models we choose the fc-interaction models from Section|^ The number of binary 
variables is n = [log2(|5'||A|)] = 6. This results in a sufficient statistics matrix with 64 columns, 
out of which we keep only the first 40, one for each pair (s, a). Eor k = 1,..., 5, the resulting 
model dimension is 2,11, 23, 29, 30. The policy polytope has dimension jSIdAI — 1) = 30. 

We consider the reinforcement learning problem, where the agent does not know W, a, /?, R in 


advance. We use stochastic gradient with an implementation of the GPOMDP algorithm (Baxter 


and Bartlett 20011 for estimating the gradient. We fix a constant learning rate of 1, a time window 
of T = 1,..., 100 for each Markov chain gradient and average reward estimation, and perform 
10 000 gradient iterations on a random parameter initialization. 

The results are shown in Pigure|^ The first column shows the learning curves for A: = 1,..., 5, 
for the first 2 500 gradient iterations. Shown is actually the average of the learning curves for 5 
repetitions of the experiment. The individual curves are indeed all very similar for each fixed k. 
The value shown is the estimated average reward, with a running average shown in bold, for better 
visibility. The second column shows the final policy. The third column gives a detail of the learning 
curves and shows the reward averaged over the entire learning process. 

The independence model, with k = I, performs very poorly, as it learns a fixed distribution of 
actions for all sensor states. The next model, with k = 2, performs better, but still has a very limited 
expressive power. All the other models have sufficient complexity to learn a (nearly) optimal policy, 
in principle. However, out of these, the less complex one, with k = 3, performs best. This indicates 
that the least complex model which is able to learn an optimal policy does learn faster. This model 
has less parameters to explore and is less sensitive to the noise in the stochastic gradient. 


7. Conclusions 

Policy optimization for partially observable Markov decision processes is a challenging problem. 
Scaling is a serious difficulty in most algorithms and theoretical results are scarce on approxima¬ 
tive methods. This paper develops a geometric view on the problem of finding optimal stationary 
policies. The maximization of the long term expected reward per time step can be regarded as a 
constrained linear optimization problem with two constraints. The first one is a quadratic constraint 
that arises from the partial observability of the world state. The second is a linear constraint that 
arises from the stationarity of the world state distribution. We can decompose the feasible domain 
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Figure 3: Experimental results on the maze Example]^ The left column shows the average reward 
learning curves for fc-interaction models, with k = 1,..., 5 from top to bottom. The 
second column shows the final policies as matrices of sensor-state action probabilities 
(white is 1). The right column compares all learning curves and shows the overall average 
reward for all models. The model with A: = 3 performs best. 


into convex pieces, on which the optimization problem is linear. This analysis sheds light into 
the complexity of stationary policy optimization for POMDPs and reveals avenues for designing 
learning algorithms. 

We show that every POMDP has an optimal stationary policy of limited stochasticity. The 
necessary level of stochasticity is bounded above by the number of sensor states that are ambiguous 
about the underlying world state, independently of the specific reward funcfion. This allows us fo 
define low dimensional models which are guaranfeed fo confain optimal sfafionary policies. Our 
experimenfs show thaf fhe proposed dimensionalify reducfion does indeed allow fo learn heller 
policies fasler. Having less paramelers, Ihese models are less expensive fo Irain and less sensilive fo 
noise, while al fhe same lime being able fo learn besl possible sfafionary policies. 
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Appendix A. The representability constraint 

Here we investigate the set of representable policies in the underlying MDP; that is, the set of kernels 
of the Iomip'^{a\w) = /3(s|r(;)7r(a|s). This set is the image G = fi3{As^A) of the linear map 

f/s- As, A Aw, a; 7r(a|s) E /3(s|t(;)7r(a|s). 

S 

We are interested in the properties of this set, depending on the observation kernel (3. 

Consider first the special case of a deterministic kernel (5^, defined by /3^(s|r(;) = 5b(^)(s), for 
some function b: W ^ S. Then 


fph{As,A) = X symAfe-i(^)_^, 
ses 

where6“^(s) C VU is fhe sef of world sfafes fhaf 6maps fo s and sym := {g G As, a’- g{'\w) = 
p, p G Aa} is fhe sef of elemenfs of A b,a thaf consisf of one repealed probabilily disfribulion. This 
sel can be written as a union of Cartesian producfs. 


/^.(A5 ,a)= U [xf X 0(-|s)) 

X 

-1 

X 

> 

ogAu.a 


ses\u 


where U := {s £ S: \b ^(s)| >1} is fhe sel of sensor slates lhal can resull from several world 
slates. For inslance, when f3 is Ihe identify function we have G = Aw,A = Aa. 

Proposition 9. Consider a measurement j3 G Aw,s and the map fj^-. As ,a Aw,a; 7r(a|s) i—)• 
Xs/3(s|tu)7r(a|s). LetU = {s£ S: | supp(/3(s|-))| > 1} be the sensor states that can be obtained 
from several world states. The set G = /^(Ag'^^) can be written as G = Uee© where each Gq 
is a Cartesian product of convex sets, Gq = X^gw ^d,w> C A^ convex, and each vertex of 
Gq lies in a face of G of dimension at most |[/|(|A| — 1). 
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Figure 4: Illustration of Example [To| Shown is a decomposition of G = C A\y,a into a 

collection of Cartesian products of convex sets, Gg, 9 £ Q. 


Proof of Proposition^ We use as index set 0 the set of policies We can write 


G= (j Ge 

with Gg = X I /j(g w)0(- g) 

+ X] /3('S|'fn)^A j 


w&W \ys£U 

s&S\U / 


This proves the first part of the claim. For the second part, note that all Gg are equal but for addition 
of a linear projection of 0 G Ajj^a- D 


Example 10. Let W = {0,1, 2}, S = {0,1}, A = {0,1}. Let /3 map w = Q and m = 1 to s = 0, 
and tt; = 2 to s = 1, with probability one. Written as a table (/?(s|t(;))^^a this is 


/3 


1 0 
1 0 
0 1 


The policy poly tope As^a is the square with vertices 


1 0 


0 1 


0 1 


1 

0 

1 0 


1 0 


0 1 


o 

1 


The polytope G = fp{As^A) ^ ^w,A is the square with vertices 


o 

1—1 


0 1' 


'0 1 


o 

I —1 

1 0 


0 1 


0 1 

? 

1 0 

o 

1—1 


1 0 


0 1 


0 1 


and can be written as a union of Cartesian products of convex sets, illustrated in Figure]^ 

G= [j Ge, Ge = {9} X {9} X Aa. 

B&Aa 
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As mentioned in Section the set F C AW x A of joint distributions that are compatible 
with the representable conditionals G = f/s{^s,A) — may not be convex. In the following 

we describe large convex subsets of F, depending on the properties of G. We use the following 
definitions. 

Definition 11. • Given a set of distributions V C A\y and a set of kernels Q C Aw,A’ 

V*g := |g(m,a) = p{w)g{a\w) G A^/xA: P G 7^,5 G 

denote the set of joint distributions over world states and actions, with world state marginals 
in V and conditional distributions in Q. 

• For any V CW let 

^w{y) '■= G Aw- supp(p) := {w GW: p{w) > 0} C 1/| 
denote the set of world state distributions with support in V. 

• Given a subset V FW and a set of kernels Q C Aw, a, let 

Q\v := |/i G Av,a ■ h{-\w) = g{-\w) for all w G V, for some g G 

denote the set of restrictions of elements of G to inputs from V. 

The following proposition states that a set of Markov kernels which is a Cartesian product 
of convex sets, with one factor for each input, corresponds to a convex set of joint probability 
distributions. Furthermore, if the considered input distributions assign zero probability to some of 
the inputs, then the convex factorization property is only needed for the restriction to the positive- 
probability inputs. 

Proposition 12. Let V C W. Let V C Aw{V) be a convex set. Let G F Aw,a satisfy G\v = 
Gw F Av,a, where Gw F Aa is a convex set for all w G V. Then V*G F AwxA A convex. 

Proof of Proposition^^ We need to show that, given any two distributions q' and q" in "P * and 
any A G [0,1], the convex combination q = \q' + (1 — A)(?" lies in P =t= This is the case if and 
only if q{w, a) = p{w)g{a\w) for some p gV and some g G Aw,a with g\v G G\v- We have 

q{w, a) = \q'{w, o) + (1 — X)q"{w, a) 

= Xp {w)g'{a\w) + (1 — X)p"{w)g"{a\w) 

= {Xp'{w) + (1 - X)p"{w)) 

/ Xp'{w) 'Mi ~ ^)p''{w) „ 

^ \Xp'{w) + [1 — X)p''{w)^ ^ Xp'{w) + [1 — X)p/'{w)^ 

This shows that q{w,a) = p{w)g{a\w), where p(m) = Xp'{w) + (1 — X)p"{w) G P and g{-\w) = 

>^wg' {■\w) + {l- Xw)g" {-Iw) G Gw, Xw = Xp'(w)+\i%p"[w) ’ ^ ^ Hence 5(a|u;)|y G G\v 

and q gV *G. □ 
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The set of Markov kernels A\y^a is a Cartesian product of convex sets Aw,A = 

The set of joint distributions AwxA = Aw * Aw ,a is a simplex, which is a convex set. 

A general set Q C Aw,a is not necessarily convex, let alone a Cartesian product of convex sets. 
However, it can always be written as a union of Cartesian products of convex sets of the form 

G = [J Go, Go = X Ge,w, Ge,w C convex. 

0e0 

For instance, one can always use Q = Q, Qe=g = {d}, Ge=g,w = {9('|n^)}- Proposition[T^ together 
with this observation, implies that given any Q C Aw,a and a convex set V C Aw, the set of joint 
distributions = V * Q G AwxA is a union of convex sets J^g = V * Qg, 6 £ Q. The situation is 
illustrated in Example [T3] 

Example 13. Consider the settings from Example [T0| The set F = Aw * G F AwxA is the union 
of following sets: 

Fg = Aw * Gg, 6 G A^. 

Each Fg C F C AwxA is a polytope with vertices 


1 

T-1 

1_ 


1 

o 

o 


0 

o' 


'0 

o' 

0 0 

5 

e ( 1 - 0 ) 

5 

0 

0 


0 

0 

1 

O 

O 


O 

o 

1_ 


1 

0 


0 

1 


Appendix B. The stationarity constraint 

In the objective function, the marginal distribution over world states is the stationary distribution of 
the world state transition kernel, and not some arbitrary distribution over world states. The coupling 
of transition kernels and marginal distributions can be described in terms of the poly tope H of joint 
distributions in AwxW with equal first and second marginals. This is given by 

H := ^p{w,w') G Awxtt/: 

w' W 


The second marginal is the result of applying the conditional as a Markov kernel to the first marginal; 
that is, ~ P{'^')- Hence equality of both marginals means that the marginal is a 

stationary distribution of the transition p{w'\w). 

The poly tope H has been studied by [Weis (20101 under the name Kirchhoff poly tope. The 
vertices of E are the joint distributions of the following form. Eor any non-empty subset W C PF 
and a cyclic permutation a: W —)• W, there is a vertex defined by 


Cw,aiw,w') 


1 j 1, if a{w) = w' 
|W| 1 0, otherwise 


The dimension is dim(S) = |PF|(|VF| — 1). To see this, note that each strictly positive transition 
p{w\w) is trivially a primitive Markov kernel and hence it has a unique stationary limit distribution. 
In turn, the set of strictly positive transitions, which has dimension has dimension |PF|(|PF| — 1), 
corresponds to the relative interior of E. 
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[0 1 ] [1 0 ] 

[o ij [o ij 



Figure 5: The polytope H C AwxW-, W = {0,1}, discussed in Example [T^ Subsets with first 
marginals satisfying p{w = 0) = g,..., | are highlighted. The right panel shows the 
corresponding sets of conditional distributions in Ay/,w- 


Example 14. Let W = {0, 1}. The non-empty subsets of W are W = {0}, {!}, {0,1}, and the 
cyclic permutations of these subsets are 0 —)• 0, 1 —)> 1, (0, 1) (1, 0). The Kirchhoff polytope S 
is the triangle enclosed by the points 



'1 o' 


0 

0 


■ 0 

1/2 

C{0},0-5-0 — 

0 0 


0 1 

5 C{0,1},(0,1)^(1,0) - 

1/2 

0 


Every strictly positive joint distribution p{w, w') corresponds to a marginal p{w) and a conditional 
distribution p{w'\w). Each point in the interior of H corresponds to a point in the interior of Aw,w- 
The situation is illustrated in Eigurej^ 


Appendix C. Determinism of optimal stationary policies for discounted rewards 

Theorem 15. Consider a POMDP (VF, S', A, a,/3, i?), a discount factor 7 G (0,1), and a start 
distribution p. Then there is a stationary policy vr* G As^a that is deterministic on each s € S with 
I supp(/3(s|-))| < 1 and satisfies TZJi{'k*) > TV^{'k) for all tt G As^a- 

We will prove Theorem [lousing a policy improvement argument. The world state value func¬ 
tion of a policy tt is the unique solution of the Bellman equation 


V^{w) = 'S^p^{a\w) R{w,a) + y'y a{w'\w,a)V^{w') 






The action value function is given by 


Q^{w, a) = R{w, a) + y a{w'\w, a)V'^{w'). 

w' 
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These definitions make sense both for MDPs and POMDPs. However, while for MDPs there is 
a stationary policy that maximizes the value of each world state simultaneously, for POMDPs the 
same is not true in general. 

Lemma 16 (POMDPpolicy improvement). Lef7r,7r' G PvopoZ/c/e5 w/t/i a) > 

(w) for all w. Then V^' {w) > {w) for all w. 


Proof of Lemma^^ The proof follows closely the arguments of the MDP deterministic policy im¬ 
provement theorem presented by |Sutton and Bart^ ( |1998 1. 

V^{w) {a\w)Q'^{w, a) 


— ,Wo=W 


— ®'7r',UI0='!iI 


R{wo,ao) + 'yV'"{wi) 

R{wo,ao) + {a\wi)Q'^ {wi, a) 


— ,Wo=W 


R{wo, ao) + -fR{wi,ai) + 
T-l 


T-s-oo 


=V^'{w) 


t=0 


□ 


Proof of Theorem [77] Consider any policy tt G A 5 ', 4 . Consider some s G 5 with supp(/3(s|-)) = w 
and d G argmax^ a). We define an alternative policy by 7r'(a|s) = 7r(a|s), s s, and 

7r'(d|s) = 1. This policy is deterministic on s. We have 


(a|ru) = /3(s|r(;)7r'(a|s) = /3(s|'u;)7r^(o|s) =p^{a\w) 


E' 

s^s 


for all w w, 


and 


In turn. 


and 


p^ {a\w) = /3(s|tD)7r'(a|s) = /3(s|tt;)7r(a|s) + /3{s\w)6a,d- 

s s^s 

{a\w)Q'"{w, a) = '^p'^{a\w)Q'^{w, a) = V'^{w), for all w w, 

a a 


'^p'^'{a\w)Q'^{w,a) ='^ /3(s|t(;)7r(a|s) + P{s\w)6a,d Q'^{w,a) 

a a 

= /3{s\w)TT{a\s) Q'"{w, a) + /3{s\w)Sa,dQ'^{w, a) 


a 

> /3(s|t())7r(a|s) Q^{w, a) + /3(s|ti;)7r(a|s)Q^(tt;, a) 

a a 


= [X]/3(s|wZ)vr(a|s) Q^{w,a] 

a s 

= '^^p'^{a\w)Q^{w,a) = V^{w). 
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This shows that {a\w)Q'^{w,a) > V'"{w), for all w. By Lemma 16 V^' (w) > V'"{w), for 


all w. Repeating the same arguments, we conclude that any policy vr can be replaced by a policy vr' 
which is deterministic on each s £ S with | supp /3(s|-)| = 1 and which satisfies {w) > V'^{w) 
for all w £ W. Sensor states with | supp(/3(s|-))| = 0 are never observed and the corresponding 
policy assignment immaterial. This completes the proof. □ 

We conclude this section with a few remarks. It is worthwhile to mention the relation 

7^(7^) 




1 - 7 ’ 


Y^p^HVYw) = Y,p^i 

w w 


a wj 


which follows from (see |Singh et al.[|1994[ Fact 7) 

R{w, a) + 7 a{w'\w, a)V'^{w' 

w' 

p'^{a\w)R{w, a) + 'y'^^p'^{w'\w)V^{w' 

w \_ a w' 

+ 'y^^p^{w)^'y^^p^{w'\w)V'^{w') 

w w' 

TZ{Tr) + 'y'y^^p'^ {w')V^ {w'). 


Note that TZJ^{'k) = Yw Hence if two policies 7 r, 7 r' satisfy V"'(w) > V^{w), 

for all w, then > TZJi{'k), for all p. However, the same hypothesis does not necessarily 

imply any particular relation between = (1 — 7 ) YwP^ {w)V'^ {w) and 7l{ir) = (1 — 

7 ) YwP^Mv^iw). 

Appendix D. Examples with analytic solutions 

We discuss three examples where it is possible to compute the optimal memoryless policy analyti¬ 
cally and show that it has stochasticity equal to the upper bound indicated in Theorem]^ This proves 
the optimality of the stochasticity bound. The first two examples consider the case |f7| = |5| = 1 
and the third example the case with arbitrarily large |f7|. 

Example 17. Consider a POMDP where the agent has only one sensor state, K possible actions, 
and the world state transitions are as shown in Figure At each world state only one action takes 
the agent further to the right, while all other actions take it to u) = 0 with probability one. At the 
world state w = K, the agent receives a reward and all actions take it to ru = 0. 


matrix is given by 


{p{w'\w))^^^> = 


7r( 

as = 

1 ) and p^ 

■f 

- VTl 

TTl 

1 

- 7r2 

vr 2 

1 

- VTs 


1 

- TTk 



1 



TTS 


TTk 
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Figure 6: State transitions from Example [TT] The number in each node indicates the world state. 

The sensor state is always s = 1. At each world state there is only one action that takes 
the agent further to the right, while all other actions take it to m = 0 with probability one. 


For this system the expected reward per time step is 7^(7r) = p^{w = K). The stationary distribu¬ 
tion of this transition matrix satisfies 


Pi 

= POTII 

P2 

= PlTl2 


Pk = Pk-1'^k- 


Using the relation 


l=P 0 +Pl-\ - \-pK = Fo(l + VTi + 711712 H-h TTi • • • TT^), 


we obtain 


PK 


_ TTl- ■ - TTk _ 

1 + TTi + 711772 + • • • + TTi • • • TTk 


(9) 


This is positive if and only if tti ,..., ttk are all larger than zero. In turn, any optimal memoryless 
stochastic policy has at least K positive probability actions at the single observation s = 1. The 
next proposition describes the precise form of the optimal memoryless policy (in this case unique). 


Proposition 18. The optimal memoryless policy of the POMDP Example^^is given by 


TTi = C 

TT^ — rti—X CTTi • • • TT^—l, % — 2, . . . , AT, 

where c is the unique real positive solution of 

TTl -I-h TTii- = 1. (10) 

Proof of Proposition^^ The policy that maximizes px can be found using the method of Lagrange 
multipliers. The critical points satisfy 1 — = 0 and — A = 0 for alH = 1,... ,K. 

Computing the derivatives of Q we find that 

TTl- ■■ttk 

7-7 ^ 1 

(1 + TTl H-h TTl • • • ttk) V 


TTi-'-TTi-l-hTTi-'-TTi^A 1 . 

- — = A, tor 7 = 1,..., A. 

1 + TTl H- \- TTl - ■ - ttk J TTi 
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This implies that 


TTi = C(1 + TTl H-h TTl • • • TTj-l), for i = 1, . . . , iT, 


where c = A ^ 

(l+ TTiH-|-7ri---7rx} 

Note that ( fTO] ), as a polynomial in c, has only positive coefficients, except for the zero degree 
coefficient, which is —1. By Descartes’ rule of sign, a polynomial with coefficients of this form has 
exactly one real positive root. □ 


Interestingly, the optimal policy of Example 17 does not assign uniform probabilities to all 
actions; it satisfies tti < • • • < -kk- The inferprefafion for fhis is fhaf, when fhe agenf has covered a 
longer disfance foward fhe reward posifion K, fhe cosl of being fransporfed back fo fhe sfarf position 
before reaching fhe reward increases. Hence fhe policy assigns more probabilify mass fo fhe ‘righf’ 
acfions af positions closer fo fhe reward posifion. This effecf is less pronounced for larger values of 
K, for which fhe opfimal policy is more uniform. For illusfralion we solved fhe polynomial ( fTO] ) 
numerically for differenl values of K. The resulfs are shown in Table [T] 


K 

TT = (VTI, . . . ,TTk) 

TZ = pk 

1 

(1) 

0.5 

2 

(0.4142,0.5858) 

0.1464 

3 

(0.2744,0.3496,0.3760) 

0.0256 

4 

(0.2104,0.2547,0.2659, 0.2689) 

0.0030 


Table 1: Opfimal memory less policies obfained from Proposifion [T^ for fhe POMDP Example [TT] 
for a: = 1,..., 4. 


Example 19, We consider a slighf generalizafion of Example [T^ Insfead of fully deferminisfic 
fransifions p{w'\w, a) we now assume fhaf af each w = 0,..., K — 1 action a = w + 1 lakes fhe 
agenf fo m + 1 wilh probabilify t^+i G (0,1] and fo = 0 wilh probabilify 1 — t^+i- The 
world sfale Iransifion mafrix is given by 

1 — tlTTl tlTTl 

1 — t2TT2 t2TT2 

1 — tsTTs fsTTs 

1 — tKT^K tKT^K 

1 


{P{w'\w))w^y,' = 


For fhis system fhe expected reward per time step is 'R.{'k) = p^{w = K) = pK- Similar fo ([^ we 
find fhaf 

_ __ fiTTi • • • txT^K __ 

1 + flVTl + • • • + tlTTl ■ ■ ■ 

In close analogy fo Proposifion [T^ we oblain fhe following descripfion of fhe opfimal policies. 


22 












Geometry and Determinism of POMDPs 


Proposition 20. The optimal memoryless policy of the POMDP Example^^is given by 

TTl = C 

ITi — ITi—X -)- CtlTTl • • • 1 ^ 1 — 1 ? t • • ■ ; ^ 

where c is the unique real positive solution of 

TTl H- \-TTK = 1. 

The next proposition shows that, in general, for this type of examples, the optimal policy cannot 
be written as a small convex combination of deterministic policies. 

Proposition 21. There is a choice of ti,... ,tK far which the optimal memoryless policy of the 
POMDP Example\19\cannot be written as a convex combination ofK — 1 deterministic policies. 

Proof of Propositional^ We show that the set of optimal memoryless policies described in Propo¬ 
sition]^ for alHi,..., tx, is not contained in any finite union of K — 2 dimensional affine spaces. 

Consider fhe expression tti + • • • + ttk, where tti = c and TTj = tti-i + cfiVTi • • • 

We view fhis as a polynomial in c wifh coefficienls depending on ti,... ,tK- The derivafive wifh 
respecf fo tx-i is non-zero (as soon as K >2 and c 7 ^ 0). Hence fhe solufion of vri ttk = 1 

is a non-consfanf function c of tx-i- 

Consider fhe sef of opfimal policies for a fixed choice of ti,..., tK -2 and an inferval T C( 0 , 1 ] 
of values of tx-i- This is given by 

(/i(c),/ 2 (c),...,/ 2 if-i(c),/ 2 if(c)fK_i), foralltj^_i E T, 

where c is a non-consfanf funcfion of tx-i, and fj is a polynomial of degree j in c wifh coefficienls 
depending on fi,..., tK- 2 - The resfriclion of Ihese veclors lo fhe firsl K — I coordinales is 

(/i(c),/ 2 (c),...,/ 2 Jr-i(c)), forallcE C, 

where C = {c{tK-i)- tx-i E T} C M is an interval wifh non-emply interior. This sef is a 
linear projection of fhe inferval {(c, c^,..., ): c E C} of fhe momenf curve in 2^-dimensional 

Euclidean space, by fhe mafrix M wifh enfry Mj^i equal lo fhe degree-z coefficienl of fj, for all 
j = 1,..., K — 1 and i = 1,... ,2^. This mafrix is full rank, since each of fhe fj has differenl 
degree. 

If is well known fhal each hyperplane infersecfs a momenf curve al mosl al finilely many poinls. 
Since our linear projecfion is full rank, fhe smallesl affine space confaining infinifely many of ils 
poinfs is equal fo fhe ambienl space In lum, no finile union of convex hulls of iT — 1 polices 

conlains fhe sef of opfimal policies for all tx-i C T. □ 

Example 22. Consider a POMDP where fhe agenl has U sensor slales, K possible acfions, and fhe 
world sfale Iransifions are as shown in Figure]^ The world slale fransilion mafrix is given by 

1 - fllTTii tllTTii 


{p{w'\w))w^w> = 


1 — tiK'XlK 
1 — t2l'X21 


tlRT^lK 

t2lT^21 


1 — tuRT^UK 
1 


tuKTtUK 
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Figure 7: State transitions from Example [22| The number in each node indicates the world state. 

The type of circle indicates the sensation of the agent (single stroke stands for s = 1, 
double stroke for s = 2, etc.). At each world state exactly one action takes the agent 
further, while all other actions take it back to m = 0 (arrows omitted for clarity). At wjjk 
the agent receives a reward of one and is invariably taken to wiq. 


The next proposition describes the optimal memoryless policy. 

Proposition 23. The optimal memoryless policy is given by 

where 

dj = 1 + tiiTTii + ■ ■ ■ + tiiTTii ■ ■ 

ej = tiiTTu- ■ 

and Cj is the unique real positive solution of 

VTji + • • • + TTjK = 1, 

for j = 1,... ,U. Here empty products are defined as 1 and empty sums as 0. 

Proof of Proposition^^ After some algebra, similar to Q, one finds that the last entry of the sta¬ 
tionary world state distribution is given by 

_ _ tiiTTii • • • tuKTTUK _ 

1 + fllVTii + • • • + fllTTii • • • tuRT^UK 

We can maximize this with respect to vr using the method of Lagrange multipliers. This yields the 
following conditions: 

= 0, for all j = 

i 

= 0, foralH = l,...,A:andj = l,...,t/. 
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From this we obtain 

1 tii-Kii-■-tuKT^UK 




TTji (1 + tllTTii H-+ tllTTii • • • tuKT^Ux) 


X 1- 


tllTTll • • • tjiTTji + • • • + tllTTll • • • tuKT^UK \ 

1 + tllTTii H- + tiiTTii- ■-tuKT^UK / 

for alH = 1 ,..., iF and j = 1,..., 

This implies 

TTjj = Cj{l + fii7rii H-hfiiVTii • ■ -tj^i-iTTj^i-i), for alH = 1,..., iT and j = 1,..., U, 






where Cj Xj — HuTru—tuKTruK)'^' 


□ 


For eaeh sensor state the optimal poliey has K positive probability aetions. In partieular, the 
smallest faee of As^a whieh eontains the optimal poliey has dimension |C/|(|yl| — 1). 
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